Generating synthesized voice and instrumental sound

ABSTRACT

There is provided a synthesized sound generating apparatus and method which can achieve responsive and high-quality speech synthesis based on a real-time convolution operation. Coefficients are generated by using dynamic cutting to extract characteristic information from a first signal. A convolution operation is performed on a second signal using the generated coefficients to generate a synthesized signal. As the convolution operation, an interpolation process is performed on the coefficients to prevent a rapid change in level of the generated synthesized signal upon switching of the coefficients.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to a synthesized sound generatingapparatus and method which is suitable for inputting and synthesizingvoices and instrumental sounds and outputting synthesized instrumentalsounds or the like having characteristic information on the voices.

2. Prior Art

Vocoders, which have a function for analyzing and synthesizing voices,are commonly used with music synthesizers due to their ability toonomatopoeically generate instrumental sounds, noise, or the like. Majorknown developed vocoders include formant vocoders, linear predictiveanalysis and synthesis systems (PARCO analysis and synthesis), cepstrumvocoders (speech synthesis based on homomorphic filtering), channelvocoders (what is called Dudley vocoders), and the like.

The formant vocoder uses a terminal analog synthesizer to carry outsound synthesis based on parameters for vocal tract characteristicsdetermined from a formant and an anti-formant of a spectral envelope,that is, pole and zero points thereof. The terminal analog synthesizeris comprised of a plurality of resonance circuits and antiresonancecircuits arranged in cascade connection for simulatingresonance/antiresonance characteristics of a vocal tract. The linearpredictive analysis and synthesis system is an extension of thepredictive encoding method, which is most popular among the speechsynthesis methods. The PARCO analysis and synthesis system is animproved version of the linear predictive analysis and synthesis system.The cepstrum vocoder is a speech synthesis system using a logarithmicamplitude characteristic of a filter and inverse Fourier transformationand inverse convolution of a logarithmic spectrum of a sound source.

The channel vocoder uses bandpass filters 10-1 to 10-N for differentbands to extract spectral envelope information on an input speechsignal, that is, parameters for the vocal tract characteristics, asshown in FIG. 1, for example. On the other hand, a pulse train generator21 and a noise generator 22 generate two kinds of sound source signals,which are amplitude-modulated using the spectral envelope parameters.This amplitude modulation is carried out by multipliers (modulators)30-1 to 30-N. Modulated signals output from the multipliers (modulators)30-1 to 30-N pass through bandpass filters 40-1 to 40-N and are thenadded together by an adder 50 whereby a synthesized speech signal isgenerated and output.

In the example of the channel vocoder disclosed in Japanese Laid-OpenPatent Publication (Kokai) No. 05-204397, outputs from the bandpassfilters 10-1 to 10-N are rectified and smoothed when passing throughshort-time average-amplitude detection circuits 60-1 to 60-N. A voicesound/unvoiced sound detector 71 determines a voice sound component andan unvoiced sound component of the input speech signal, and upondetecting the voice sound component, the detector 71 operates a switch23 so as to select and deliver an output (pulse train) from the pulsetrain generator 21 to the multipliers 30-1 to 30-N. In addition, upondetecting the unvoiced sound component, the voice sound/unvoiced sounddetector 71 operates the switch 23 so as to select and deliver an output(noise) from the noise generator 22 to the multipliers 30-1 to 30-N. Atthe same time, a pitch detector 72 detects a pitch of the input speechsignal to cause it to be reflected in the output pulse train from thepulse generator 21. Thus, when the voice sound component is detected,the output from the pulse generator 21 contains pitch information, whichis among characteristic information on the input speech signal.

According to the above described formant vocoder, however, since theformant and anti-formant from the spectral envelope cannot be easilyextracted, the formant vocoder requires a complicated analysis processor manual operation. The linear predictive analysis and synthesis systemuses an all-pole model to generate sounds and uses a simple mean squarevalue of prediction errors, as an evaluative reference for determiningcoefficients for the model. Thus, this method does not focus on thenature of voices. The cepstrum vocoder requires a large amount of timefor spectral processing and Fourier transformation and is thusinsufficiently responsive in real time.

On the other hand, the channel vocoder directly expresses the parametersfor the vocal tract characteristics in physical amounts in the frequencydomain and thus takes the nature of voices into consideration. Due tothe lack of mathematical strictness, however, the channel vocoder is notsuited for digital processing.

SUMMARY OF THE INVENTION

There is provided a synthesized sound generating apparatus and methodwhich can achieve responsive and high-quality speech synthesis based ona real-time convolution operation. Coefficients are generated by usingdynamic cutting to extract characteristic information from a firstsignal. A convolution operation in the time domain is performed on asecond signal using the generated coefficients to generate a synthesizedsignal. An interpolation process is performed on the coefficients toprevent a rapid change in level of the generated synthesized signal uponswitching of the coefficients.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram showing an example of a conventional vocoder;

FIG. 2 is a block diagram showing the construction of a synthesizedsound generating apparatus according to an embodiment of the presentinvention;

FIG. 3 is a view useful in explaining a convolution operation;

FIG. 4 is a waveform diagram useful in explaining a manner ofdynamically cutting out waveforms used as coefficients;

FIG. 5A is a waveform diagram useful in explaining a manner ofcoefficient interpolation carried out in switching from a coefficient Ato a coefficient B;

FIG. 5B is a waveform diagram useful in explaining a manner ofcoefficient interpolation carried out in switching from the coefficientA to a coefficient B′;

FIG. 6 is a block diagram showing the construction of a synthesizedsound generating apparatus according to another embodiment of thepresent invention; and

FIG. 7 is a diagram useful in explaining a cross fade process.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

The present invention will be described below in detail with referenceto the drawings showing preferred embodiments thereof.

FIG. 2 is a block diagram showing the construction of a synthesizedsound generating apparatus according to an embodiment of the presentinvention. In this embodiment, the synthesized sound generatingapparatus according to the present invention is applied to a vocoder togenerate a synthesized signal by dynamically cutting out waveforms froman analog speech signal (a first signal) input from a microphone or thelike, to extract characteristic information therefrom to therebygenerate coefficients and convoluting the generated coefficients into ananalog instrumental sound signal (or a music signal (second signal) froman electric guitar, a synthesizer, or the like.

The input analog speech signal is converted into a digital value(digital speech signal) by an AD converter 1-1. At the same time, aninput analog instrumental-sound signal is converted into a digital value(digital instrumental-sound signal) by an AD converter 1-2. Outputs fromthe AD converters 1-1, 1-2 are processed by digital signal processors(DSP) 2-1, 2-2, respectively.

The digital signal processor 2-1 subjects the digital speech signal fromthe AD converter 1-1 to sound pressure control and sound qualitycorrection, and cuts out sound waveforms from the speech signal atpredetermined time intervals of, for example, 10 to 20 ms to generatecoefficients h, which are transmitted to a convolution circuit (CNV) 3.The digital signal processor 2-2 subjects the digital instrumental-soundsignal to sound pressure control and sound quality correction to supplythe processed signal to the convolution circuit 3 as data.

The sound pressure control by the digital signal processors 2-1, 2-2comprises correcting and controlling, for example, the sound pressurelevel (dynamic range), and the sound quality correction comprisescorrecting the frequency characteristic. Further, the sound pressurecontrol includes creating sound characters. Also low-frequency rangenoise from the microphone is cut off.

The convolution circuit 3 performs a convolution operation based on thecoefficients h output from the digital signal processor 2-1 and the dataoutput from the digital signal processor 2-2. The coefficients areupdated at the same time intervals (cycle) as those at which the soundwaveforms are cut out, that is, every 10 to 20 ms.

The convolution circuit 3 executes the convolution operation in a mannersuch as one shown in FIG. 3. That is, an input x(n), which is outputdata from the digital signal processor 2-2, is sequentially delayed byone-sample delay devices D1 to DN-1. Then, multipliers MO to MN-1multiply the input x(n) and signals x(n−1) to x(n−N+1) obtained bydelaying the input x(n), by the coefficients h(0) to h(N-1) output fromthe digital signal processor 2-1, respectively. Outputs from themultipliers MO to MN-1 are sequentially added together by adders Al toAN-1, to obtain an output y(n).

Thus, the output y(n) is expressed by Equation 1 given below:${y(n)} = {\sum\limits_{i - 0}^{N - 1}{{h(i)}*{x\left( {n - i} \right)}}}$

This convolution operation is realized by a well-known FIR (finiteimpulse response) filter. With a small filter length, the filter acts asan equalizer to carry out a frequency characteristic-correctingfunction, whereas with a large filter length, the filter can executesignal processing called reverberation. In common convolutionoperations, the coefficients h are fixed, but in the present inventionthese coefficients are varied. Specifically, in the present inventionwaveforms of the speech signals cut out at the short time intervals asdescribed above are used as the coefficients. The coefficients areautomatically updated in response to the sequentially varying speechsignal. The instrumental sound signal thus convoluted with thecoefficients as described above is similar to those obtained throughprocessing by the conventional vocoders.

The coefficient switching cycle is preferably between 10 and 20 ms forboth men and women. The waveform cutting-out with a fixed cycle,however, results in clip noise or distortion in the signal, which isaurally sensed. To avoid this, the digital signal processor 2-1 obtainsthe coefficients h used for the convolution operation by dynamicallycutting out waveforms in such a manner that each waveform starts at azero cross point and ends at another zero cross point separated from thefirst one by a time interval which is close to a reference switchingcycle Δt.

For example, if the input speech signal varies as shown in FIG. 4 andwhen waveforms W1 and W2 are cut out with the fixed switching cycle Δt,there is a high probability that the start and end points of eachwaveform do not coincide with zero cross points P1, P2, . . . , and P6.Thus, the digital signal processor 2-1 dynamically varies thecutting-out cycle. Specifically, the waveform cutting-out is executed bydetermining from actual waveforms, time intervals Δt−α, Δt—β, Δt—α′, andΔ+β′, each corresponding to a section between two zero cross pointswhich is close to the fixed switching cycle Δt.

A similar technique is known from a sound waveform cutting-out deviceused in a speech synthesis apparatus proposed by Japanese Laid-OpenPatent Publication (Kokai) No. 7-129196. The object of this patent,however, is to generate waveforms for one pitch and is not directed tothe convolution coefficients for vocoders. The pitch information is notso important to the vocoder according to the present invention becauseit updates the coefficients through interpolation.

Even if the dynamically cut-out coefficients are used for theconvolution operation as described above, if a coefficient A has awaveform passing through zero cross points as shown in FIGS. 5A and 5B,the waveform of the actually output synthesized signal undergoes a rapidchange in level when the coefficient A is instantaneously switched tothe next coefficient B. This may also result in clip noise ordistortion, which is aurally sensed. To avoid such a rapid change inlevel, the convolution circuit 3 in FIG. 2 slowly switches from thecoefficient A to the next coefficient B′ by executing an interpolationover a period of time substantially equal to the cutting-out interval,as shown in FIG. 5B. This solves the noise or distortion problem.

Various interpolation operation methods may be applied to the aboveinterpolation, among which the linear interpolation is simplest.According to the linear interpolation, if the interpolation time isdenoted by c [ms], the initial coefficient value by a, and the finalcoefficient value by b, then the coefficient value obtained a time x=t[ms] after the start of the interpolation is f(x)=(b−a)/c*x+a when x≦cand f(x)=b when x>c. In fact, a new final coefficient value is set whenx=c, to start a new coefficient interpolation.

The coefficients generated by the digital signal processor 2-1 throughthe above described processing are stored in a memory (RAM) 4. Thecoefficients are then supplied to the convolution circuit 3 under thecontrol of a CPU 5. An output from the convolution circuit 3 is impartedwith effects such as sound quality correction and echoes by a digitalsignal processing circuit 6, and is then converted back into an analogsignal by a D/A converter 7 to be output as a synthesized speech signal.

FIG. 6 shows the construction of a synthesized sound generatingapparatus (vocoder) according to another embodiment of the presentinvention. In the synthesized sound generating apparatus according tothe present embodiment, two convolution circuits 3-1, 3-2 are arrangedin parallel to carry out a cross fade interpolation process. That is,the two convolution circuits 3-1, 3-2 do not have such an interpolationfunction as is provided by the convolution circuit 3 in FIG. 2, and areeach comprised of an inexpensive LSI.

Similarly to the synthesized sound generating apparatus in FIG. 2, theAD converter 1-1 converts an input analog speech signal into a digitalvalue (digital speech signal). At the same time, the AD converter 1-2converts an input analog instrumental sound signal into a digital value(digital instrumental sound signal). The digital signal processor 2-1subjects the digital speech signal from the AD converter 1-1 to soundpressure control and sound quality correction, and cuts out soundwaveforms from the speech signal at predetermined time intervals of, forexample, 10 to 20 ms to generate the coefficients h, which aretransmitted to the convolution circuits (CNV) 3-1 and 3-2. The digitalsignal processor 2-2 subjects the digital instrumental sound signal tosound pressure control and sound quality correction to supply theprocessed signal to the convolution circuits 3-1 and 3-2 as data.

The coefficients generated by the digital signal processor 2-1 aretemporarily stored in the RAM 4. The coefficients are then supplied tothe convolution circuits 3-1 and 3-2 under the control of the CPU 5. Theconvolution circuits 3-1 and 3-2 each execute a convolution operationbased on the coefficients from the digital signal processor 2-1 and thedata from the digital signal processor 2-2. Outputs from the convolutioncircuits 3-1, 3-2 are imparted with effects such as sound qualitycorrection and echoes by the digital signal processing circuit 6, andare then converted back into an analog signal by the D/A converter 7 tobe output as a synthesized speech signal. In the present embodiment, thedigital signal processor 6 carries out a cross fade process in contrastto the configuration in FIG. 2.

The cross fade process executed by the digital signal processor 6 isshown in FIG. 7. That is, the output CNV1 from the first convolutioncircuit 3-1 and the output CNV2 from the second convolution circuit 3-2are caused to partly overlap on the time axis and cross each other insuch a manner that the latter half of the preceding output is faded outwhile the former half of the following output is simultaneously fadedin, thereby reducing noise which may occur if the coefficients areinstantaneously switched. For example, when the latter half B of theoutput CNV1 is faded out, the former half C of the output CNV2 issimultaneously faded in. Next, when the latter half D of the output CNV2is faded out, the former half E of the next output CNV1 issimultaneously faded in. In the illustrated example, the length of thesection over which the outputs CNV1 and CNV2 overlap each other is madeequal to the dynamically varying switching cycle Δt, previouslydescribed with reference to FIG. 4. Therefore, the required length ofeach waveform cut out by the digital signal processor 2-1 in FIG. 6 isessentially twice or more as large as that in the configuration in FIG.2.

Therefore, it is an object of the present invention to provide asynthesized sound generating apparatus and method which can achieveresponsive and high-quality speech synthesis based on a real-timeconvolution operation.

To attain the above object, according to a first aspect of the presentinvention, there is provided a synthesized sound generating apparatuscomprising a coefficient generating device that generates coefficientsby using dynamic cutting to extract characteristic information from afirst signal; and a synthesized signal generating device that carriesout a convolution operation on a second signal using the coefficientsgenerated by the coefficient generating device to generate a synthesizedsignal.

In a preferred embodiment of the first aspect, the synthesized signalgenerating device comprises a convolution circuit that carries out aninterpolation process on the coefficients to prevent a rapid change inlevel of the generated synthesized signal upon switching of thecoefficients.

In a typical example of the first aspect, the first signal is a speechsignal, and the characteristic information extracted from the speechsignal indicates one waveform starting at a zero cross point and endingat another zero cross point separated from the zero cross point by atime interval close to a reference switching cycle.

Preferably, the time interval is determined from an actual waveform ofthe speech signal.

In a typical example of the first aspect, the signal is an instrumentalsound signal.

To attain the above object, according to a second aspect of the presentinvention, there is provided a synthesized signal generating apparatuscomprising a coefficient generating device that dynamically continuouslycuts out waveforms from a first signal in a manner such that adjacentones of the waveforms cut out from the first signal partly overlap eachother, to extract characteristic information therefrom to generatecoefficients, a pair of convolution circuits that are operative inparallel, the convolution circuits alternately receiving thecoefficients generated from the waveforms continuously cut out by thecoefficient generating device and carrying out convolution operations ona second signal using the coefficients to generate a first synthesizedsignal and a second synthesized signal, respectively, and a cross fadeprocessing device that carries out a cross fade process on the firstsynthesized signal and the second synthesized signal generated by thepair of convolution circuits, upon switching of the coefficients.

In a typical example of the second aspect, the first signal is a speechsignal, and the characteristic information extracted from the speechsignal indicates one waveform starting at a zero cross point and endingat another zero cross point separated from the zero cross point by atime interval close to a reference switching cycle.

Preferably, the time interval is determined from an actual waveform ofthe speech signal.

In a typical example of the second aspect, the second signal is aninstrumental sound signal.

To attain the above object, according to a third aspect of the presentinvention, there is provided a synthesized sound generating methodcomprising a coefficient generating step of generating coefficients byusing dynamic cutting to extract characteristic information from a firstsignal, and a synthesized signal generating step of carrying out aconvolution operation on a second signal using the coefficientsgenerated by the coefficient generating device to generate a synthesizedsignal.

To attain the above object, according to a fourth aspect of the presentinvention, there is provided a synthesized signal generating methodcomprising a coefficient generating step of dynamically continuouslycuts out waveforms from a first signal in a manner such that adjacentones of the waveforms cut out from the first signal partly overlap eachother, to extract characteristic information therefrom to generatecoefficients, a convolution step of alternately receiving thecoefficients generated from the waveforms continuously cut out by thecoefficient generating step and carrying out convolution operations on asecond signal using the coefficients to generate a first synthesizedsignal and a second synthesized signal, and a cross fade processing stepof carrying out a cross fade process on the first synthesized signal andthe second synthesized signal generated by the convolution step, uponswitching of the coefficients.

To attain the above object, the present invention further provides asynthesized sound generating apparatus comprising a coefficientgenerating means for generating coefficients by using dynamic cutting toextract characteristic information from a first signal, and asynthesized signal generating means for carrying out a convolutionoperation on a second signal using the coefficients generated by thecoefficient generating means to generate a synthesized signal.

To attain the above object, the present invention also provides asynthesized signal generating apparatus comprising a coefficientgenerating means for dynamically continuously cuts out waveforms from afirst signal in a manner such that adjacent ones of the waveforms cutout from the first signal partly overlap each other, to extractcharacteristic information therefrom to generate coefficients, aconvolution means for alternately receiving the coefficients generatedfrom the waveforms continuously cut out by the coefficient generatingmeans and carrying out convolution operations on a second signal usingthe coefficients to generate a first synthesized signal and a secondsynthesized signal, and a cross fade processing means for carrying out across fade process on the first synthesized signal and the secondsynthesized signal generated by the convolution means, upon switching ofthe coefficients.

According to the present invention, a real-time convolution operationcan be realized to achieve responsive and high-quality speech synthesis.According to the present invention, it is unnecessary to distinguishbetween the voice sound component and unvoiced sound component of theinput speech signal as in the conventional channel vocoder. Further, thepresent invention can reduce the size of the circuit. The presentinvention is not limited to speech signals and can accommodate variousinput signals.

The above and other objects of the invention will become apparent fromthe following detailed description taken in conjunction with theaccompanying drawings.

What is claimed is:
 1. A synthesized sound generating apparatuscomprising: a coefficient generating device that generates coefficientsby using dynamic continuous cutting to extract characteristicinformation from a first signal; and a synthesized signal generatingdevice that carries out a time domain convolution operation on a secondsignal using the coefficients generated by said coefficient generatingdevice to generate a synthesized signal, wherein said synthesized signalgenerating device includes a convolution circuit that carries out aninterpolation process between a present coefficient and a coefficientgenerated immediately next to said present coefficient of saidcoefficients to prevent a rapid change in a level of the generatedsynthesized signal upon switching of said coefficients.
 2. A synthesizedsignal generating apparatus according to claim 1, wherein saidconvolution circuit carries out said interpolation process over a periodof time substantially equal to a period of time over which said dynamiccontinuous cutting is used by said coefficient generating device.
 3. Asynthesized signal generating apparatus according to claim 1, whereinsaid first signal is a speech signal, and said characteristicinformation extracted from said speech signal indicates one waveformstarting at a zero cross point and ending at another zero cross pointseparated from said zero cross point by a time interval close to areference switching cycle.
 4. A synthesized signal generating apparatusaccording to claim 3, wherein said time interval is determined from anactual waveform of said speech signal.
 5. A synthesized signalgenerating apparatus according to claim 3, wherein said second signal isan instrumental sound signal.
 6. A synthesized signal generatingapparatus comprising: a coefficient generating device that dynamicallycontinuously cuts out waveforms from a first signal in a manner suchthat adjacent ones of the waveforms cut out from the first signal partlyoverlap each other, to extract characteristic information therefrom togenerate coefficients; a pair of convolution circuits that are operativein parallel, said convolution circuits alternately receiving saidcoefficients generated from said waveforms continuously cut out by saidcoefficient generating device and carrying out convolution operations ona second signal using the coefficients to generate a first synthesizedsignal and a second synthesized signal, respectively; and a cross fadeprocessing device that carries out a cross fade process on said firstsynthesized signal and said second synthesized signal generated by saidpair of convolution circuits, upon switching of said coefficients.
 7. Asynthesized signal generating apparatus according to claim 6, whereinwherein said first signal is a speech signal, and said characteristicinformation extracted from said speech signal indicates one waveformstarting at a zero cross point and ending at another zero cross pointseparated from said zero cross point by a time interval close to areference switching cycle.
 8. A synthesized signal generating apparatusaccording to claim 7, wherein said second signal is an instrumentalsound signal.
 9. A synthesized signal generating apparatus according toclaim 7, wherein said time interval is determined from an actualwaveform of said speech signal.
 10. A synthesized sound generatingmethod comprising: generating coefficients by using dynamic continuouscutting to extract characteristic information from a first signal; andcarrying out a time domain convolution operation on a second signalusing the generated coefficients to generate a synthesized signal,wherein in said carrying out step, an interpolation process is carriedout between a present coefficient and a coefficient generatedimmediately next to said present coefficient of said coefficients toprevent a rapid change in a level of the generated synthesized signalupon switching of said coefficients.
 11. A synthesized signal generatingmethod comprising: a coefficient generating step of dynamicallycontinuously cuts out waveforms from a first signal in a manner suchthat adjacent ones of the waveforms cut out from the first signal partlyoverlap each other, to extract characteristic information therefrom togenerate coefficients; a convolution step of alternately receiving saidcoefficients generated from said waveforms continuously cut out by saidcoefficient generating step and carrying out convolution operations on asecond signal using the coefficients to generate a first synthesizedsignal and a second synthesized signal; and a cross fade processing stepof carrying out a cross fade process on said first synthesized signaland said second synthesized signal generated by said convolution step,upon switching of said coefficients.
 12. A synthesized sound generatingapparatus comprising: a coefficient generating means for generatingcoefficients by using dynamic continuous cutting to extractcharacteristic information from a first signal; and a synthesized signalgenerating means for carrying out a convolution operation on a secondsignal using the coefficients generated by said coefficient generatingmeans to generate a synthesized signal, wherein said synthesized signalgenerating means includes a convolution circuit that carries out aninterpolation process between a present coefficient and a coefficientgenerated immediately next to said present coefficient of saidcoefficients to prevent a rapid change in a level of the generatedsynthesized signal upon switching of said coefficients.
 13. Asynthesized signal generating apparatus comprising: a coefficientgenerating means for dynamically continuously cuts out waveforms from afirst signal in a manner such that adjacent ones of the waveforms cutout from the first signal partly overlap each other, to extractcharacteristic information therefrom to generate coefficients; aconvolution means for alternately receiving said coefficients generatedfrom said waveforms continuously cut out by said coefficient generatingmeans and carrying out convolution operations on a second signal usingthe coefficients to generate a first synthesized signal and a secondsynthesized signal; and a cross fade processing means for carrying out across fade process on said first synthesized signal and said secondsynthesized signal generated by said